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Abstract —  This  paper  presents  an  architecture  and  a  synthesis 
method  for  programmable  numerical  function  generators  (NFGs) 
for  trigonometric,  logarithmic,  square  root,  and  reciprocal  func¬ 
tions.  Our  NFG  partitions  a  given  domain  of  the  function  into 
non-uniform  segments  using  an  LUT  cascade,  and  approximates 
the  given  function  by  a  quadratic  polynomial  for  each  segment. 
Thus,  we  can  implement  fast  and  compact  NFGs  for  a  wide  range 
of  functions.  Implementation  results  on  an  FPGA  show  that: 
1)  our  NFGs  require  only  4%  of  the  memory  needed  by  NFGs 
based  on  the  linear  approximation  with  non-uniform  segmenta¬ 
tion;  and  2)  our  NFGs  require  only  22%  of  the  memory  needed 
by  NFGs  based  on  the  5th-order  approximation  with  uniform  seg¬ 
mentation.  Our  automatic  synthesis  system  generates  such  com¬ 
pact  NFGs  quickly. 

I.  Introduction 

Numerical  function  generators  (NFGs)  are  often  used  in 
computer  graphics,  digital  signal  processing,  communication 
systems,  robotics,  astrophysics,  fluid  physics,  etc.  The  func¬ 
tions  realized  include  trigonometric,  logarithmic,  square  root, 
and  reciprocal  functions.  High-performance  CPUs  usually 
have  numerical  coprocessors.  However,  embedded  CPUs  and 
CPUs  on  FPGAs  do  not  have  such  coprocessors.  Thus,  FPGA 
implementation  of  numerical  functions  f(x)  is  needed.  Im¬ 
plementation  by  a  single  lookup  table  for  fix)  is  simple  and 
fast.  For  low-precision  computations  of  fix)  (e.g.  x  and  fix) 
have  8  bits),  this  implementation  is  straightforward.  For  high- 
precision  computations,  however,  the  single  lookup  table  im¬ 
plementation  is  impractical  due  to  the  huge  table  size.  For 
such  applications,  the  CORDIC  (Coordinate  Rotation  Digi¬ 
tal  Computer)  algorithm  [1,21]  has  been  often  used.  Although 
CORDIC  is  implemented  with  compact  hardware,  it  is  itera¬ 
tive  and  therefore  slow.  For  numerically  intensive  applications, 
faster  evaluation  of  numerical  function  is  required. 

For  fast  evaluation  of  numerical  functions,  polynomial  ap¬ 
proximations  have  been  used  [9,  10,  19,  20].  These  meth¬ 
ods  approximate  the  given  numerical  functions  by  piecewise 
polynomials,  and  realize  the  polynomials  with  hardware.  Fin- 
ear  or  quadratic  approximations  offer  fast  and  relatively  high- 
precision  evaluation  of  numerical  functions.  However,  the 
methods  proposed  so  far  are  ad-hoc  and  not  systematic.  This 
paper  proposes  an  architecture  and  a  systematic  synthesis 
method  for  NFGs  based  on  quadratic  approximation.  By  using 
the  FUT  cascade  [8],  many  numerical  functions  are  efficiently 
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Fig.  1 .  Synthesis  flow  for  NFGs. 


approximated  by  piecewise  quadratic  functions.  Our  synthesis 
method  can  be  automated,  so  that  fast  and  compact  NFGs  can 
be  produced  by  non-experts.  Fig.  1  shows  the  synthesis  flow 
for  the  NFG.  It  converts  the  Design  Specification  described  by 
Scilab  [18],  a  MATFAB-like  software,  into  HDF  code.  The 
Design  Specification  consists  of  a  function  fix),  a  domain  for 
x,  and  an  acceptable  error.  This  system  first  partitions  the  do¬ 
main  into  segments,  and  then  approximates  f(x)  by  a  quadratic 
function  for  each  segment.  Next,  it  analyzes  the  errors,  and  de¬ 
rives  the  necessary  precision  for  computing  units  in  the  NFG. 
Then,  it  generates  HDF  code  to  be  mapped  into  an  FPGA  us¬ 
ing  an  FPGA  vendor  tool.  Due  to  the  page  limitation,  the  error 
analysis  for  our  NFGs  is  omitted  here,  but  it  is  available  in 
[14].  This  paper  extends  [17]  to  quadratic  approximations. 

II.  Preliminaries 

Definition  2.1  The  binary  fixed-point  representation  of  a 

value  r  has  the  form 

dnjnt—  1  dfijnt—  2  ...  d\  do.  d-i  d—  2  . . .  d~n_frac,  (1) 

where  dj  £  {0, 1},  nJnt  is  the  number  of  bits  for  the  integer 
part,  and  n_frac  is  the  number  of  bits  for  the  fractional  part  of 
r.  The  representation  in  (1)  is  two’s  complement,  and  so 

nJnt—2 

r=-2nJnt-1dnjnt-i+  X  2‘d'- 

i=—n_frac 

Definition  2.2  Error  is  the  absolute  difference  between  the 
original  value  and  the  approximated  value.  Approximation 
error  is  the  error  caused  by  a  function  approximation,  and 
rounding  error  is  the  error  caused  by  a  binary  fixed-point  rep¬ 
resentation.  Acceptable  error  is  the  maximum  error  that  an 
NFG  may  assume.  Acceptable  approximation  error  (AAE)  is 
the  maximum  approximation  error  that  a  function  approxima¬ 
tion  may  assume. 
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Definition  2.3  Precision  is  the  total  number  of  bits  for  a  bi¬ 
nary  fixed-point  representation.  Specially,  n-bit  precision 
specifies  that  n  bits  are  used  to  represent  the  number;  that  is, 
n  =  nJnt+U-frac.  An  n-bit  precision  NFG  has  an  n-bit  input. 

Definition  2.4  Accuracy  is  the  number  of  bits  in  the  fractional 
part  of  a  binary  fixed-point  representation.  Specially,  m-bit  ac¬ 
curacy  specifies  that  m  bits  are  used  to  represent  the  fractional 
part  of  the  number;  that  is,  m  =  n_frac.  An  m-bit  accuracy 
NFG  is  an  NFG  with  m-bit  fractional  part  of  the  input,  m-bit 
fractional  part  of  the  output,  and  a  2~m  acceptable  error. 

III.  Quadratic  Approximation  Algorithm 

To  approximate  the  numerical  function  f(x)  using  quadratic 
functions,  first,  we  partition  the  domain  for  x  into  segments. 
For  each  segment,  we  approximate  f(x)  using  a  quadratic 
function  g(x)  =  c 2X2  +  c\x+  cq.  In  this  case,  the  approxima¬ 
tion  error  depends  on  the  segmentation  method  and  the  values 
of  coefficients  C2,  c \ ,  and  c o  in  the  approximation  polynomial. 

For  piecewise  polynomial  approximations,  in  many  cases, 
the  domain  is  partitioned  into  uniform  segments  [2,  6,  19]. 
Such  methods  are  simple  and  fast,  but  for  some  kinds  of  nu¬ 
merical  functions,  too  many  segments  are  required,  resulting 
in  large  memory. 

For  a  given  error,  non-uniform  segmentation  of  the  domain 
uses  fewer  segments  than  the  uniform  segmentation  [9,  17], 
However,  a  non-uniform  segmentation  often  requires  a  com¬ 
plicated  segment  index  encoder  (see  Section  IV),  and  results 
in  larger  and  slower  NFGs.  To  overcome  this  problem,  a  spe¬ 
cial  non-uniform  segmentation  has  been  proposed  [9].  This 
method  produces  a  simple  segment  index  encoder  by  restrict¬ 
ing  the  segmentation  points,  and  results  in  fewer  segments  as 
well  as  faster  and  more  compact  NFGs  than  produced  by  uni¬ 
form  segmentation.  However,  it  is  ad-hoc  and  non-optimum 
for  the  given  function.  Our  NFG  can  implement  any  non- 
uniform  segmentation  with  a  fast  and  compact  segment  index 
encoder  by  using  an  LUT  cascade  [  17]  with  a  synthesis  method 
that  can  be  automated. 

Selection  of  the  approximation  polynomial  influences  the 
number  of  non-uniform  segments  as  well  as  the  approximation 
error.  In  this  paper,  we  use  the  2nd-order  Chebyshev  approxi¬ 
mation  to  approximate  f(x)  with  fewer  non-uniform  segments, 
and  compute  the  approximated  value.  Since  coefficients  of  the 
Chebyshev  approximation  polynomial  are  easily  computed,  it 
is  suitable  for  automatic  synthesis. 

A.  Segmentation  Algorithm 

For  a  segment  j'.v,  e]  of  fix),  the  maximum  approximation 
error  £2(s,e)  of  the  2nd-order  Chebyshev  approximation  [11] 
is  given  by 

82  ^  ^  =  ^192  ^(3)  ^  I  ’  (2) 

where  is  the  3rd-order  derivative  of  /.  From  (2),  82  (s,e) 
is  a  monotone  increasing  function  of  segment  width  e  —  s. 
Using  this  property,  we  partition  a  domain  into  as  wide  seg¬ 
ments  as  possible  such  that  the  approximation  error  is  less 


Input:  Numerical  function  fix),  Domain  [a,b\  for  x. 

Acceptable  approximation  error  8. 

Output:  Segments  [50,e0],  [si,ei], . . . ,  [st-i,et-i]. 

Process: 

1 .  Let  sq  =  a  and  i  =  0. 

2.  Find  a  value  p  (>  sf)  where  82 {st,p)  =  8. 

3.  If  p  >  b,  then  let  p  =  b. 

4.  Let  e,  =  p  and  i  =  i  +  1. 

5.  If  p  =  b,  then  let  t  =  i,  and  stop  the  process. 

6.  Else,  let  s /  =  p,  and  go  to  step  2. 


Fig.  2.  Non-uniform  segmentation  algorithm  for  the  domain. 


than  the  specified  error.  Fig.  2  shows  the  non-uniform  seg¬ 
mentation  algorithm.  The  inputs  for  this  algorithm  are  a  nu¬ 
merical  function  f(x),  a  domain  [a,  b\  for  x,  and  an  accept¬ 
able  approximation  error  8.  Then,  this  algorithm  approximates 
f{x )  with  the  acceptable  approximation  error  8,  and  produces 
t  segments  [s0,eo],  [sr,ei], . . .,  [5,-1, e,_i].  For  step  2  in  Fig.  2, 
the  accurate  computation  of  the  value  p  where  82 (s;,p)  =  8 
is  difficult.  Thus,  we  obtain  the  maximum  value  p'  satisfy¬ 
ing  82 isi,p')  <  8.  Such  p’  can  be  found  by  scanning  values  of 
mbit  input  x.  However,  it  requires  0(2")  search,  and  is  time- 
consuming.  Therefore,  we  compute  the  maximum  value  p’  by 
setting  0  or  1  from  MSB  to  LSB  of  x  such  that  82  isi,P')  <  e. 
This  requires  0(«)  search.  In  the  computation  of  82 (v,.//),  the 
value  of  maXj.<j<„i  |  (jc)  |  is  computed  by  the  nonlinear  pro¬ 
gramming  algorithm,  which  is  one  of  the  most  efficient  [7], 

B.  Computation  of  Approximate  value 

For  each  [s/,e/],  f(x)  is  approximated  by  the  corresponding 
quadratic  function  gi  (x) .  That  is,  the  approximated  value  y  of 
fix)  is  computed  as  follows: 


y  =  giix)  =  c2,x2  +  cux  +  co/,  (3) 

where  the  coefficients  C2/,  ci/,  and  co/  are  derived  from  the  2nd- 
order  Chebyshev  approximation  polynomial  [11].  Substituting 
x  —  qi  +  <27  for  x  in  (3)  yields  the  transformation 

giix)  =  C2iix-q;)2 +  icu  +  2c2iqi)ix-qi) 

+coi  +  cuqi  +  C2iqf  ■  (4) 

In  (4),  let  c'u  =  ci/  +  2c2,q,  and  c'0j  =  c0/  +  cuqt  +  c2tq2.  Then, 
we  have 


giix)  =  c2/0  -  qi)2  +  du{x  -  qi)  +  Cq/.  (5) 

This  transformation  reduces  the  multiplier  size. 

IV.  Architecture  for  NFGs 

Fig.  3  shows  the  architecture  that  realizes  (5).  It  uses  7  units: 
the  segment  index  encoder  that  computes  the  index  i  for  seg¬ 
ment  [s/,e/j  including  the  input  value  x;  the  coefficients  table 
for  —qt,  C2i,  c\j,  and  c'Qj\  the  adder  forx+  (— <2/);  the  squaring 
unit;  two  multipliers;  and  the  final  adder. 
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4)  Adder:  by  logic  elements.  Our  synthesis  system  derives  the 
appropriate  bit-width  for  each  component  by  automatic  error 
analysis. 


A.  Size  Reduction  of  Multiplier 


Fig.  3.  Architecture  forNFGs. 


Fig.  4.  Segment  index  encoder. 


A  segment  index  encoder  converts  x  into  a  segment  in¬ 
dex  i.  It  realizes  the  segment  index  function  seg-func(x)  : 
B”  — >  {0,  l,...,f  —  1}  shown  in  Fig.  4  (a),  where  x  has  n  bits, 
B  =  {0, 1},  and  f  denotes  the  number  of  segments.  In  [9],  to 
simplify  the  segment  index  encoder,  the  values  of  Sj  and  e\  are 
restricted  to  what  can  be  produced  by  a  simple  combinational 
logic  circuit.  Such  a  segmentation  method  results  in  many  seg¬ 
ments  since  it  does  not  adapt  to  the  given  function.  Our  syn¬ 
thesis  system  uses  the  LUT  cascade  [8.  15,  16]  shown  in  Fig.  4 
(b)  to  realize  arbitrary  seg-func(x).  It  can  be  designed  by  func¬ 
tional  decomposition  using  BDDs  (Binary  Decision  Diagrams) 
representing  seg_func(x).  Our  synthesis  system  uses  a  nonre- 
strictive  segmentation.  It  is  suitable  for  automatic  synthesis. 
In  LUT  cascades,  the  interconnecting  lines  between  adjacent 
LUTs  are  called  rails.  The  size  of  an  LUT  cascade  depends  on 
the  number  of  rails.  The  next  theorem  shows  that  the  segment 
index  functions  are  realized  by  compact  LUT  cascades. 

Theorem  4.1  [16]  Let  seg-func{x )  be  a  segment  index  func¬ 
tion  with  t  segments.  Then,  there  exists  an  LUT  cascade  for 
seg-func(x )  with  at  most  |"log2f]  rails. 

Our  synthesis  system  uses  heterogeneous  MDDs  (Multi-valued 
Decision  Diagrams)  [13]  to  find  compact  LUT  cascades.  Since 
the  LUT  cascade  is  suitable  for  the  pipeline  processing,  it  of¬ 
fers  a  fast  and  compact  circuit.  In  Section  VI,  we  will  show 
that  our  architecture  produces  fast  and  compact  NFGs  for  var¬ 
ious  numerical  functions. 


Although  modern  FPGAs  have  dedicated  multipliers,  large 
multipliers  are  slow.  In  our  architecture,  the  multiplier  often 
has  the  longest  delay  time  among  all  the  units.  Thus,  to  imple¬ 
ment  a  fast  NFG,  reducing  multiplier  size  is  important.  Since 
the  size  of  multipliers  depends  on  the  number  of  bits  for  C2i, 
c'u,  and  x—  qj,  it  is  important  to  reduce  the  number  of  bits  to 
represent  these  values. 

First,  we  consider  the  case  where  the  absolute  values  of 
C2i  and  c'u  are  large.  Our  synthesis  method  uses  a  scaling 
method  [9].  We  represent  C2i  and  c'u  as  C2i  =  C2i  x  2“,2/  x  2'2i 
and  c'u  =  c'u  x  2~lli  x  2l,i,  respectively.  That  is,  instead  of  the 
original  values  of  C2i  and  c'u,  we  store  the  values  of  C2i  x  2~lli, 
hi,  c'u  x  2~lli,  and  In  in  the  coefficients  table.  In  this  case, 
the  products  c2; (x  —  qj)2  and  c\ i (x  —  qj)  are  computed  using 
multipliers  and  shifters.  The  use  of  hi  and  In  reduces  the  num¬ 
ber  of  bits  to  represent  the  values  of  C2i  x  2~l2i  and  c'u  x  2~lli, 
but  increases  the  rounding  errors.  Our  synthesis  method  finds 
optimum  values  of  hi  and  In  for  each  segment  such  that  an 
acceptable  error  is  achieved.  When  hi  and  In  are  0  for  all  the 
segments,  no  shifter  is  implemented,  that  is,  C2i{x  —  qj)2  and 
c'u(x—  q[)  are  directly  implemented  with  multipliers. 

Next,  we  consider  the  value  of  x  —  qj.  The  number  of  bits 
for  x  —  qj  influences  the  sizes  of  the  squaring  unit  and  mul¬ 
tipliers.  Thus,  reducing  the  value  of  x  —  qj  reduces  the  sizes 
of  the  squaring  unit  and  multipliers,  and  also  the  error.  From 
(5),  we  can  choose  any  value  for  qj.  To  reduce  the  value  of 
x  —  qj ,  for  a  segment  [sj, ej\,  we  set  qj  =  ( Sj  +  e ,) / 2.  Then,  we 
have  \x  —  qi\  <  (e,-  —  Sj)/2.  Thus,  reducing  the  segment  width 
ej  —  Sj  reduces  the  value  for  x—  cp.  However,  this  also  increases 
the  number  of  segments,  and  results  in  increased  memory  size. 
The  rest  of  this  section  shows  a  reduction  method  of  segment 
width  without  increasing  the  memory  size. 

The  coefficients  table  in  Fig.  3  has  2k  words,  where  k  = 
|~log2f]  and  t  is  the  number  of  segments.  Therefore,  we  can 
increase  the  number  of  segments  up  to  t  =  2k  without  increas¬ 
ing  the  memory  size.  From  Theorem  4.1,  the  size  of  LUT  cas¬ 
cade  also  depends  on  the  value  of  k.  However,  increasing  the 
number  of  segments  to  t  =  2k  seldom  increases  the  size  of  the 
LUT  cascade.  We  reduce  the  size  of  segments  by  dividing  the 
largest  segment  into  two  equal  sized  segments  up  to  t  =  2k. 
This  method  reduces  both  the  number  of  bits  for  x  —  qj  and  the 
error  without  increasing  the  memory  size. 


LUT 

LUT 

LUT 

— 

— 

(b)  LUT  cascade. 
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so  <x<eo 

0 

s\  <x<e\ 

1 

Sr-i  <x<e,-i 

(-1 

(a)  Segment  index  function. 


V.  Implementation  with  FPGA 

Modern  FPGAs  consist  of  logic  elements  (LEs)  or  config¬ 
urable  logic  blocks  (CLBs),  synchronous  memory  blocks,  mul¬ 
tipliers  (DSP  units),  etc.  Our  synthesis  system  efficiently  gen¬ 
erates  NFGs  using  these  components.  Each  unit  for  the  NFG 
shown  in  Fig.  3  is  implemented  by  the  following  components 
in  an  FPGA:  1)  Segment  index  encoder  (LUT  cascade)  and 
coefficients  table:  by  synchronous  memory  blocks;  2)  Squar¬ 
ing  unit:  by  logic  elements;  3)  Multiplier:  by  DSP  units;  and 


B.  Pipeline  Processing 

To  implement  a  high-throughput  NFG  in  an  FPGA,  our  syn¬ 
thesis  system  inserts  pipeline  registers  between  all  units  in  the 
architecture.  Since  all  units  operate  in  parallel,  and  each  unit 
has  a  short  delay  time,  our  NFGs  achieves  high  throughput. 
Table  I  shows  the  units  and  the  number  of  pipeline  stages  for 
them.  Our  NFGs  have  n_cas  +  (5  or  6)  pipeline  stages,  where 
n_cas  is  the  number  of  LUTs  for  the  LUT  cascade. 
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TABLE  II 


Number  of  segments  for  various  approximation  methods. 


Function 

/« 

Domain 

AAE  = 

F17 

AAE  = 

F75 

Linear 

Non 

2nd-Cheb 

Uniform 

yshev 

Non 

Time 

[msecl 

Linear 

Non 

2nd-Cheb 

Uniform 

yshev 

Non 

Time 

[msecl 

2X 

[0. 1] 

128 

9 

7 

0.1 

2048 

65 

44 

70 

l/x 

[1.2) 

124 

16 

11 

0.1 

1982 

128 

64 

60 

yjx 

[1/32,  2) 

193 

252 

24 

10 

3082 

2016 

138 

150 

1/vT 

[1.2) 

46 

16 

8 

0.1 

1024 

128 

46 

50 

log,  (x) 

[1.2) 

128 

16 

10 

10 

2048 

128 

56 

70 

ln(x) 

[1,2) 

89 

16 

9 

10 

1437 

128 

50 

50 

sin(7u:) 

[0.  1/2) 

127 

17 

12 

10 

2027 

129 

74 

90 

cos(7u:) 

[0.  1/2) 

127 

17 

12 

10 

2027 

129 

74 

90 

tan(7ur) 

[0.  1/4) 

112 

33 

12 

10 

1787 

129 

73 

110 

vFM*) 

[1/32,  1) 

354 

31744 

52 

70 

5933 

8126464 

331 

720 

tan2  (jix)  + 1 

[0,  1/4) 

256 

33 

17 

20 

4096 

257 

101 

170 

Entropy 

[1/256,  255/256] 

520 

509 

40 

30 

8320 

4065 

234 

300 

Sigmoid 

[0.  1] 

127 

33 

13 

20 

2020 

129 

76 

160 

Gaussian 

10,  1/21 

32 

5 

4 

0.1 

512 

33 

18 

30 

Average 

170 

2337 

17 

20 

2739 

580995 

99 

100 

AAE:  Acceptable  Approximation  Error. 
Linear:  Linear  approximation. 

Uniform:  Uniform  segmentation. 
Experiment  environment: 


Time:  CPU  time  for  our  non-uniform  segmentation  algorithm. 
2nd-Chebyshev:  2nd-order  Chebyshev  approximation. 

Non:  Non-uniform  segmentation. 

CPU:  Pentium4  Xeon  2.8GHz  Memory:  4GB 


OS:  Redhat  (Linux  7.3) 


C  compiler:  gee  -02 


TABLE I 

Number  of  pipeline  stages  for  NFGs. 


Name  of  units 

Pipeline  stages 

1 .  Segment  index  encoder 

njoas 

2.  Coefficients  table 

i 

3.  Adder  for.r-|-  (—qt) 

i 

4.  Squaring  unit 

i 

5.  Multipliers  (parallel) 

i 

6.  Shifter  (optional) 

0  or  1 

7.  Final  adder 

1 

Total  pipeline  stages 

n_cas+  (5  or  6) 

rucas :  Number  of  LUTs  forLUT  cascade. 


VI.  Experimental  Results 
A.  Number  of  Segments  and  Computation  Time  of  Algorithm 

Table  II  compares  the  number  of  segments  for  various  ap¬ 
proximation  methods  for  the  functions  in  [16].  In  this  table, 
Entropy,  Sigmoid,  and  Gaussian  are 

Entropy  =  — xlog2  x  —  ( 1  —  x)  log2  (1  —  x), 

1  1 

Sigmoid  =  - — ,  and  Gaussian  =  ; _ e  t 

1  +  e  4x  V27t 

In  Table  II,  the  columns  “Linear  Non”  show  the  number  of 
non-uniform  segments  for  linear  approximation  in  [17],  and 
the  columns  “2nd-Chebyshev  Uniform”  and  “2nd-Chebyshev 
Non”  show  the  number  of  uniform  segments  and  non-uniform 
segments  for  2nd-order  Chebyshev  approximation,  respec¬ 
tively.  The  columns  “Time”  show  the  CPU  time  for  our  non- 
uniform  segmentation  algorithm  applied  to  functions,  in  mil¬ 
liseconds. 

Table  II  shows  that,  for  many  functions,  the  2nd-order 
Chebyshev  approximations  require  many  fewer  segments  than 
the  linear  approximation.  However,  for  some  functions,  such 
as  y/  I  n  (x) ,  the  2nd-order  Chebyshev  approximation  based 
on  uniform  segmentation  requires  many  more  segments  than 


the  linear  and  2nd-order  Chebyshev  approximations  based  on 
non-uniform  segmentations.  Many  existing  polynomial  ap¬ 
proximation  methods  are  based  on  uniform  segmentation.  For 
trigonometric  and  exponential  functions,  approximation  meth¬ 
ods  based  on  uniform  segmentation  require  relatively  few 
segments.  However,  for  some  kinds  of  functions  such  as 
y/—  In  (A),  the  uniform  2nd-order  approximation  method  re¬ 
quires  excessively  many  segments.  On  the  other  hand,  our 
quadratic  approximation  based  on  non-uniform  segmentation 
requires  fewer  segments  for  a  wide  range  of  functions.  Also, 
Table  II  shows  that  the  CPU  time  is  strongly  correlated  to  the 
number  of  segments.  Smaller  acceptable  approximation  error 
(AAE)  requires  more  segments  and  longer  computation  time. 
However,  Table  II  shows  that,  for  all  functions  in  the  table, 
the  CPU  times  are  shorter  than  1  second  when  the  acceptable 
approximation  error  is  2  25 . 

These  results  show  that,  for  various  functions,  our  segmen¬ 
tation  algorithm  partitions  a  domain  into  fewer  non-uniform 
segments  quickly,  and  it  is  useful  for  automatic  synthesis. 

B.  Memory  Sizes  of  Various  NFGs 

This  section  compares  the  memory  sizes  of  our  NFGs  with 
three  existing  NFGs  [17,  3,  4],  Table  III  compares  NFGs  using 
linear  approximation  shown  in  [17].  This  linear  approxima¬ 
tion  is  based  on  non-uniform  segmentation.  In  Table  III,  the 
columns  “R”  show  the  following  values: 

^  memory  size  of  quadratic  approximation  ^  ^ 
memory  size  of  linear  approximation 

Table  III  shows  that  NFGs  using  quadratic  approximation  re¬ 
quire  much  smaller  memory  than  ones  using  linear  approxi¬ 
mation.  Especially,  24-bit  precision  NFGs  using  quadratic  ap¬ 
proximation  can  be  implemented  with  only  4%  of  the  memory 
size  needed  for  a  linear  approximation.  From  the  relation  be¬ 
tween  precision  and  memory  size  shown  in  Table  III,  we  can 
see  that  increasing  the  precision  decreases  the  ratio  of  memory 
sizes  in  NFGs. 
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TABLE  III 

Comparison  with  linear  approximation  based  on  non-uniform 


SEGMENTATION. 


Function 

/w 

16-bit  precision 

24-bit  precision 

Memop 

[bits! 

R 

r%i 

Memon 

rbitsl 

R 

m 

Linear 

Ouad. 

Linear 

Ouad. 

2X 

20992 

1112 

5 

696320 

19072 

3 

l/x 

21248 

2432 

11 

700416 

19136 

3 

V* 

43776 

5536 

13 

1425408 

86784 

6 

l/V* 

10176 

1104 

11 

343040 

19008 

6 

log2(x) 

20864 

2464 

12 

694272 

19072 

3 

lnf.v) 

20096 

2448 

12 

700416 

19136 

3 

sin(TO) 

19456 

2336 

12 

661504 

38656 

6 

cos(7D:) 

19584 

2336 

12 

663552 

38784 

6 

tan(7u:) 

19712 

2304 

12 

667648 

38272 

6 

s/~  ln(x) 

74240 

11264 

15 

2662400 

173056 

7 

tan2  (to)  + 1 

37632 

4960 

13 

1290240 

39040 

3 

Entropy 

106496 

10688 

10 

3768320 

83968 

2 

Sigmoid 

21120 

2432 

12 

702464 

40320 

6 

Gaussian 

4416 

444 

10 

156672 

8384 

5 

Average 

31415 

3704 

11 

1080905 

45906 

4 

Memory:  Memory  size.  Linear:  Linear  approximation  [17]. 

Quad.:  2nd-order  Chebyshev  approximation.  R\  Ratio. 


TABLE  IV 

Comparison  with  5th-order  approximation  based  on  uniform 

SEGMENTATION. 


Func. 

/M 

Domain 

Acc. 

Memory  size  fbitsl 

Ratio 

[%] 

5th-order 

(Uniform) 

Quad. 

(Non) 

sin  (to) 

[0,  1/4] 

2-li 

70528 

18048 

26 

exp  (a) 

[0,  1] 

2~24 

82432 

43136 

52 

2X  -  1 

[0,  1] 

2_  24 

89600 

19968 

22 

Acc.:  Accuracy. 

5th-order:  5th-order  approximation  [3]. 
Quad. :  2nd-order  Chebyshev  approximation. 


Table  IV  and  Table  V  compare  our  NFGs  with  NFGs  using 
5th-order  Taylor  expansion  [3]  and  NFGs  using  2nd-order  min¬ 
imax  approximation  by  the  Remez  algorithm  [4],  respectively. 
Both  approximations  in  [3,  4]  are  based  on  uniform  segmen¬ 
tation.  Thus,  their  NFGs  require  no  segment  index  encoder. 
On  the  other  hand,  since  our  approximation  is  based  on  non- 
uniform  segmentation,  the  memory  size  is  obtained  by  the  sum 
of  the  coefficients  table  and  the  segment  index  encoder.  As 
shown  in  [17]  and  Table  II,  for  trigonometric  and  exponential 
functions,  the  difference  of  the  number  of  uniform  segments 
and  non-uniform  segments  is  not  so  large  under  the  same  ap¬ 
proximation  polynomial.  For  such  functions,  NFGs  based  on 
uniform  segmentation  (needing  no  segment  index  encoder)  of¬ 
ten  require  smaller  memory  than  non-uniform  segmentations. 
Although  our  NFGs  require  the  segment  index  encoder  and 
use  approximation  polynomials  with  larger  approximation  er¬ 
ror  than  approximation  polynomials  in  [3,  4],  our  NFGs  for 
such  functions  are  implemented  with  only  22%  to  52%  of  the 


TABLE  V 

Comparison  with  quadratic  approximation  based  on  uniform 


segmentation. 


Func. 

Domain 

Acc. 

Memory  size  fbitsl 

Ratio 

/(*) 

Minimax 

(Uniform) 

Cheb. 

(Non) 

[%] 

sin(TO/4) 

[0,  1) 

2 -24 

16288 

19200 

118 

2X  -  1 

[0.  1) 

2~ 16 

2208 

2512 

114 

Minimax:  2nd-order  minimax  approximation  [4], 
Cheb. :  2nd-order  Chebyshev  approximation. 


memory  sizes  of  NFGs  in  [3],  and  with  memory  size  com¬ 
parable  to  [4],  In  [3,  4],  memory  sizes  of  NFGs  for  V/T  and 
\J  —  ln(x)  are  unavailable.  However,  from  Table  II,  we  can  see 
that  the  memory  size  of  their  NFGs  for  yT  and  \j  —  In  (A)  is 
excessively  large.  On  the  other  hand,  our  NFGs  can  realize  a 
wide  range  of  functions  with  small  memory  size. 

C.  FPGA  Implementation  Results 

Table  VI  compares  the  FPGA  implementation  results  of  our 
NFGs  with  NFGs  using  linear  approximation  [17]. 

Since  the  architecture  of  linear  NFG  is  simpler  than 
quadratic  NFG,  linear  NFGs  are  faster,  and  require  fewer  logic 
elements  and  DSP  units  than  quadratic  NFGs.  However,  lin¬ 
ear  approximation  requires  more  segments  and  larger  mem¬ 
ory  than  quadratic  approximation,  as  shown  in  Table  II  and 
Table  III.  Table  VI  shows  that  24-bit  precision  linear  NFGs 
cannot  realize  any  function  except  Gaussian  with  the  FPGA 
(the  smallest  device  in  the  Stratix  family)  due  to  the  excessive 
memory  size  although  many  logic  elements  and  DSP  units  are 
unused.  The  most  crucial  issue  in  the  FPGA  implementation 
is  the  constraints  on  these  hardware  resources.  For  24-bit  pre¬ 
cision,  the  linear  approximation  requires  a  larger  FPGA  due 
to  the  excessive  memory  size.  However,  in  the  larger  FPGA, 
more  logic  elements  and  DSP  units  are  left  unused  and  wasted. 
On  the  other  hand,  the  quadratic  NFGs  can  be  implemented 
with  a  smaller  FPGA  since  they  require  much  less  memory 
size  than  the  linear  NFGs  and  reasonable  sizes  of  logic  ele¬ 
ments  and  DSP  units.  In  fact,  24-bit  precision  quadratic  NFGs 
can  be  implemented  with  lower  cost  and  more  compact  FPGAs 
(Cyclone  II). 

VII.  Conclusion  and  Comments 

We  have  demonstrated  an  architecture  and  a  synthesis 
method  for  programmable  NFGs  for  trigonometric  functions, 
logarithm  functions,  square  root,  reciprocal,  etc.  Our  archi¬ 
tecture  can  efficiently  realize  any  non-uniform  segmentation 
using  a  compact  LUT  cascade,  and  approximate  many  numer¬ 
ical  functions  by  quadratic  polynomials.  Therefore,  our  archi¬ 
tecture  is  suitable  for  automatic  synthesis  of  fast  and  compact 
NFGs.  Implementation  results  on  an  FPGA  show  that  our  syn¬ 
thesis  method  can  approximate  a  wide  range  of  functions  with 
a  small  number  of  non-uniform  segments,  and  generate  NFGs 
with  small  memory  size.  For  24-bit  precision,  our  NFGs  can  be 
implemented  with  only  4%  of  the  memory  size  of  NFGs  based 
on  the  linear  approximation  with  non-uniform  segmentation, 
and  with  only  22%  of  the  memory  size  of  NFGs  based  on  the 
5th-order  approximation  with  uniform  segmentation.  NFGs 
based  on  the  linear  approximation  are  faster  than  the  quadratic 
ones,  but  for  high-precision,  they  require  a  large  FPGA  due  to 
the  excessive  memory  size.  On  the  other  hand,  our  quadratic 
NFGs  can  be  implemented  with  more  compact  and  low-cost 
FPGA  by  using  hardware  resources  on  the  FPGA  efficiently. 
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TABLE  VI 

FPGA  IMPLEMENTATION  OF  NFGS  FOR  LINEAR  AND  QUADRATIC  APPROXIMATIONS. 


FPGA  device:  Altera  Stratix  (EP1S10F484C5:  10570  logic  elements,  48  DSP  units) 

Logic  synthesis  tool:  Altera  QuartusII  5.0 

Synthesis  options: speed  optimization,  timing  requirement:  200MHz 


Function 

fix) 

16-bit  precision 

24-bit  precision 

Logic  elements 

DSP  units 

Freq.  TMHzl 

Logic  elements 

DSP  units 

Freq. 

TMHzl 

Linear 

Ouad. 

Linear 

Ouad. 

Linear 

Ouad. 

Linear 

Ouad. 

Linear 

Ouad. 

Linear 

Ouad. 

T 

167 

482 

2 

4 

195 

185 

604 

758 

2 

10 

- 

131 

l/x 

204 

376 

2 

4 

234 

186 

636 

859 

2 

10 

- 

134 

\fx 

270 

496 

2 

4 

237 

179 

1211 

822 

2 

16 

- 

124 

1/VT 

186 

475 

2 

4 

237 

186 

402 

753 

2 

10 

- 

131 

log2(x) 

163 

381 

2 

4 

194 

186 

597 

757 

2 

10 

- 

131 

lnf.v) 

170 

379 

2 

4 

197 

185 

416 

863 

2 

10 

- 

131 

sin  (tlx) 

154 

424 

2 

4 

197 

192 

480 

646 

8 

10 

- 

134 

cos(7d:) 

172 

354 

2 

4 

237 

179 

412 

647 

8 

10 

- 

131 

tan(rar) 

234 

382 

2 

4 

237 

178 

655 

604 

2 

10 

- 

131 

—  ln(-Jc) 

304 

623 

2 

10 

215 

135 

854 

942 

8 

16 

- 

130 

tan-(7uj  +  1 

132 

282 

2 

4 

194 

215 

991 

720 

2 

10 

- 

135 

Entropy 

141 

403 

2 

4 

235 

206 

1370 

914 

2 

16 

- 

128 

Sigmoid 

167 

430 

2 

4 

194 

191 

627 

706 

2 

10 

- 

131 

Gaussian 

181 

419 

2 

4 

237 

186 

303 

747 

2 

10 

216 

129 

Average 

189 

422 

2 

4 

217 

185 

683 

767 

3 

11 

- 

131 

Linear:  Linear  approximation  [17].  Quad.:  2nd-order  Chebyshev  approximation.  Freq.:  Operating  frequency. 

— :  NFGs  cannot  be  mapped  into  the  FPGA  due  to  the  excessive  memory  size. 

Memory  sizes  are  omitted  in  this  table  (see  Table  III). 
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